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^ ■ We address the problems of multi-domain and single-domain regression based on distinct and unpaired 

Cn ' labeled training sets for each of the domains and a large unlabeled training set from all domains. We 

formulate these problems as a Bayesian estimation with partial knowledge of statistical relations. We 
propose a worst-case design strategy and study the resulting estimators. Our analysis explicitly accounts 
for the cardinality of the labeled sets and includes the special cases in which one of the labeled sets is 
C^ . very large or, in the other extreme, completely missing. We demonstrate our estimators in the context 

C^ ' of removing expressions from facial images and in the context of audio- visual word recognition, and 

provide comparisons to several recently proposed multi-modal learning algorithms. 
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1. Introduction 

There are many applications in which one can access data from multiple domains in order to perform a 
task. For example, word recognition can greatly benefit from the availability of joint audio-visual mea- 
surements [17]. Person recognition and verification can be performed much more accurately by fusing 
information from several modalities such as facial images, iris scans, voice recordings, and handwrit- 
ings. 

A major difficulty in fusing multiple sources is that one can often access only distinct labeled training 
sets for the different domains and does not have paired labeled examples from all domains. Suppose, for 
instance, we wish to perform audio-visual gender recognition. There are numerous existing data-sets 
of labeled voice recordings as well as labeled data-sets of facial images. However, there are only a few 
jointly labeled audio-visual data-sets, with a limited number of different subjects each. Thus, although it 
is straight forward to train a classifier based on audio or image data alone, it is not clear how to best fuse 
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the two modalities, in particular when they are unpaired. While paired multi-domain labeled examples 
are typically scarce, paired unlabeled examples are often abundant. For instance, enormous amounts of 
speaker video sequences (together with audio) can be easily collected. These videos, though, often do 
not come with labels. Nonetheless, they can be used to unveil the statistical relations between audio 
and video. An important question is how to best fuse audio- and image-based predictors, given these 
relations. 

An even more interesting and practical question is whether the availability of multiple data sources 
can aid a machine learning algorithm during training, even if not all are measured during testing. For 
example, suppose we want to predict the age of a person based on an audio recording of him/her Assume 
we have a labeled audio training set, a labeled image training set, and a large amount of unlabeled audio- 
visual examples. Can the visual examples help construct a predictor, which is solely based on audio? 

In this paper we address the problem of multi-domain as well as single-domain regression based on 
distinct (unpaired) labeled training sets for each of the domains and an unlabeled multi-domain training 
set. Specifically, focusing on two domains for simplicity, we consider the situation in which we have at 
our disposal a very large unlabeled training set {x\ , JCj } as well as two labeled sets {x\ , y' } and {jCj ,y'}. 
Using this multi-domain training data, we treat the problems of designing a predictor of y based on 
(jCi,jC2) (multi-domain regression) and a predictor of y based on Xi alone (single-domain regression). 
Our analysis is general in that it explicitly accounts for the cardinality of the labeled sets. In particular, 
it includes the special cases in which one or both labeled sets are very large as well as the cases in which 
one of the labeled sets is completely missing. 

Several problems of similar nature have been treated in the literature. Perhaps the most widely stud- 
ied of these is multi-view learning [2] in general and multi-view regression [10] in particular. These 
techniques make use of a large training set of data from multiple domains (views), containing only a 
few labeled examples. It has been shown that if the views tend to agree in some sense, then the unla- 
beled examples are useful in constructing a single-view estimator [2, 10]. In our setting, however, we 
do not observe even a single multi-domain labeled example {x\,X2,y'} and also make no assumptions 
on the underlying distribution. A multi-view framework for distinct labeled training sets, recently pro- 
posed in [1], assumes the availability of a mapping function which can generate a good estimate of 
the unobserved view from the observed one. In our setting, we do not assume that such a mapping is 
known or even exists. These distinctions have profound implications. In particular, the lack of labeled 
multi-domain samples in our scenario implies that, even if our single-domain sets are infinite, we may 
only be able to deduce the joint distribution of (jei,jiC2), of {xi,y), and of {x2,y)- This, however, does 
not suffice, in general, to determine the conditional distribution of y given (^1,^:2)^ and therefore, for 
instance, the minimum mean-square error (MMSE) estimator p (xj , JC2) = E[F \Xi = jci ,^2 = X2] cannot 
be constructed. 

Situations in which labeled samples {x2,y'} from a source domain are used to construct a predictor 
of y from a target domain JCi fall under the category of transfer learning [18]. In some cases, unlabeled 
examples, as well as a few labeled examples {x[,y'} from the target domain are also available. Tradi- 
tional transfer learning algorithms are suited for domains admitting a common feature representation. 
For example, the different domains may be images of an object taken from different views, in which 
case the extracted features are of the same type. Extension to different representations may be handled 
via the multiple-outlook learning framework [9]. Nevertheless, in both these settings paired unlabeled 
examples {jCjjJCj} from the two domains are not accessible. In this sense, our setting allows learning 
via supervised-transfer of knowledge. 

More related to our problem are the cross-modality and shared-representation learning scenarios 
recently studied in [17] in the context of multi-modal learning. In both settings, unlabeled training data 
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{jCj , JCj} from multiple modalities, such as audio and video, are used to perform a feature learning stage. 
In cross-modality learning, one constructs a predictor based on Xi alone using a labeled training set 
{Xj ,y'}. For example, we may want to build a classifier operating on audio features by observing labeled 
audio examples in addition to unlabeled audio-visual instances. In shared-representation learning, one 
constructs a predictor based on jci alone using a labeled training set {jCjjj'}. For instance, we may want 
to train an audio classifier by observing only labeled visual examples in addition to unlabeled audio- 
visual instances. Cross-modality regression was recently studied from a Bayesian estimation perspective 
in [15], in which a link to instrumental variable regression [3] was also highlighted. As we show, both 
cross-modality and shared-representation learning are special cases of our approach, corresponding to 
the situation in which there are zero examples in one of the labeled sets. 

In this paper we formulate regression from unpaired data sets as a Bayesian estimation problem 
with partial knowledge of statistical relations. Specifically, we assume that, for each domain, we can 
determine the predictor that minimizes the mean square error (MSB) among some class of estimators. 
This can be done using the labeled training examples from the associated domain. Furthermore, we 
assume that we can determine the joint probability distribution of the data from the two domains using 
the unlabeled examples. Now, every joint distribution of labels and (multi-domain) data which is con- 
sistent with this knowledge is considered valid. The performance of any estimator depends, of course, 
on the unknown distribution. Thus, our approach in this paper is to seek estimators whose worst-case 
MSB over the set of valid distributions is the smallest possible. 

We show that the minimax problems we obtain have simple, yet nontrivial, closed form solutions 
which can be easily approximated from the available training examples. These expressions also provide 
insight into how data from multiple domains should be taken into account. In particular, we show that, 
from a worst-case standpoint, a domain with no labeled examples cannot help. Thus, it is impossible 
to perform cross-modality regression without making any assumptions on the underlying distributions. 
We illustrate our approach in the contexts of face normalization and audio-visual word recognition. In 
the former application, we demonstrate how an image of a smiling face can be converted into one with a 
neutral expression, without observing paired examples of neutral and smiling faces. In the latter setting, 
we show how spoken digits can be recognized from silent video (lipreading) when only labeled audio 
examples are available. We also show how they can be recognized from audio, when there is access 
only to labeled video examples. The experiments indicate that our approach is preferable to that of [17]. 

The remainder of this paper is organized as follows. In Section 2 we present the setting of interest in 
detail and discuss several special cases. We provide a mathematical formulation of our regression prob- 
lems in Section 3. The minimax multi-domain and single-domain estimators are derived in sections 4 
and 5, respectively. Finally, experimental results are provided in Section 6. 



2. Problem Formulation 

We denote random variables (RVs) by capital letters (e.g., Xi,X2,Y) and the values that they take by bold 
lower-case letters {e.g., x\ , j:2j3')- The pseudo-inverse of a matrix A is denoted by A. The second-order 
moment matrix of an RV X is denoted by Fxx = 'E,[XX'^], where E[-] is the mathematical expectation 
operator. Similarly, the cross second-order moment matrix of two RVs X and Y is denoted by Fxy = 
E[X7^]. The joint cumulative distribution function of the RVs X and Y is written FxY{x,y) = F{X ^ 
x,Y ^ y), where the inequalities are element-wise. By definition, the marginal distribution of X is 
Fx{x) — Fxy{x,°°). In our setting, Y is the quantity to be estimated, and Xi and X2 are two sets of 
measurements (features). The RVs X\,X2, and Y take values in R*'', W^'^, and K'^, respectively. 

Our goal in this paper is to propose an estimation theoretic approach for solving certain regression 



^ -h 



4 



-k 



4 of 24 



MICHAELI, ELDAR AND SAPIRO 




Fig. 1: Multi-domain regression. (a),(b) Single-domain training with many /few labeled examples (Sec- 
tion 4. 1). (c) Multi-domain training with few labeled examples (sections 4.2 and 4.3). (d) Multi-domain 
training with many unpaired labeled examples from one domain and few from the other domain (sec- 
tions 4.4 and 4.5). 



problems in which several distinct training sets are available during training. More specifically, we 
assume we are given access to three possible data-sets as follows: 

1. labeled examples {ix{,y^)}^^-i from domain 1; 

2. labeled examples {(-«^2'3''')}(?iz, +i from domain 2; 

3. paired unlabeled examples {{x",X2)}J^^ ^^ ^j. 

These training sets correspond to independent draws from the distributions Fx^y, FxiYi and FxiXii respec- 
tively. Our focus is on situations in which U is very large, so that the joint distribution Fx^Xi can be 
assumed known (or very well approximated, for example, by nonparametric methods). The cardinalities 
Li and Li of the labeled sets are arbitrary. In particular, one of them can be zero. In in this case no knowl- 
edge whatsoever is available regarding the statistical relation between Y and the associated domain. On 
the other extreme, one (or both) of the labeled sets may be very large, in which case the associated 
single-domain MMSE estimator, say E[y |Xi], can be assumed known (or accurately approximated). 

In terms of testing, we treat two tasks. The first is multi-domain regression, in which the algorithm is 
asked to predict y based on an observation of xi and X2- The second is single-domain regression, where 
prediction should be based solely on xi (including the case where no x\ labeled data is available for 
training, that is, L\ = 0). Several archetypical situations are depicted in figs. 1 and 2. Here, single- and 
double-lined circles correspond, respectively, to RVs that are unobserved and observed during testing. 
A continuous line, a dashed line, and lack of a line between circles corresponds, respectively, to many, 
few and zero training examples. 

3. Estimation Theoretic Formulation 

In this paper we adopt and generalize the framework proposed in [15] by posing our problem as one of 
estimation with partial knowledge of statistical relations. Before formalizing our multi-domain semi- 
supervised problem in estimation theoretic terms, we first recall the common practice for regression 
from one domain with a limited number of examples. 

3 . 1 Single-Domain Regression 

Suppose we are given a sample {x'',y^}^_^, x G M*^, independently drawn from the joint distribution 
of X and Y . If L is very large, then nonparametric methods can be used to approximate the conditional 
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Fig. 2: Single-domain regression. (a),(b) Cross-domain learning [17] with many/few labeled examples 
(Section 5.1). (c),(d) Shared-representation regression [17], also referred to as estimation with partial 
knowledge [15], with many/few labeled examples (Section 5.2). (e),(f) Multi-domain training with 
many/few labeled examples from the unobserved domain (Section 5.3). 



expectation estimator (p{x) =E,[Y\X = x] with great accuracy at any x. Such estimates, however, are 
often far from accurate when L is small. Common practice in such situations is to use parametric or 
semi-parametric methods that impose some structure on the sought predictor In other words, rather 
than trying to approximate the regression function (p[x) — K[Y\X — x], which minimizes the mean 
square error among all functions of X, we settle for approximating the optimal predictor among some 
family £/ of functions: 



(Ps^ = argminE [\\Y - (p{X)\ 



(3.1) 



The less rich the class £/ is, the more accurate we can typically approximate (pj^{X) from the training 
data. This comes, of course, at the cost that the (theoretical) MSB that (p^(X) achieves is higher. 
This is the well known bias-variance tradeoff. In the sequel, we term the function (p^{X) of (3.1) the 
^-optimal estimator of Y from X. 

One of the simplest structural restrictions corresponds to linear estimation, so that £/ is the set of all 
linear functions from R^ to M.^ . In this case. 



(p^{x) = rYxrl^x 



(3.2) 



The second-order moment matrices Fyxj^xx can be estimated from the training set, for example, by 
using sample moments. A more general model corresponds to functions of the form 



(piX)= Y,ak(PkiX), 

k=l 



(3.3) 
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where {(Pk}f=i is a predefined set of functions and the coefficients {aic}f=i are arbitrary. The optimal 
set of coefficients a = (ai • • • ok) is given in this case by 

a^rl^r<pY, (3.4) 

whereF$$ denotes the /T x /T matrix whose (/,y)-thentry is E[^/^(X)^y(X)] andr<pY is aKx 1 vector 
whose ith component is E[^j?^(X)y]. These quantities can be estimated from the training data similar to 
the linear setting. 

In both examples above, x/ forms a linear subspace of functions: for every (p^ ,(p^ E x/ and a , j3 G K, 
the function acp^ + li(P^ also belongs to £/. For future reference, we note that this claim is also trivially 
true when £/ is taken to be the set of all (Borel-measurable) functions, in which case (pj^{X) — K[Y\X], 
and when £/ contains only the zero function, in which case (pj^{X) = 0. 

3.2 Statistical Knowledge Deduced from Separate Training Sets 

In our setting we have access to two sperate unpaired sets of labeled examples, one for each domain. 
Consequently, besides the standard uncertainty in statistics, which has to do with the fact that the under- 
lying distributions are not known but rather only samples are observed, here there is another degree of 
uncertainty. Specifically, even if the number of training examples is taken to infinity in all three sets, we 
can only hope to be able to determine the joint distributions FxiY, Fx2Y and FxiX2- These do not suffice in 
general for computing the MMSE estimate E[y |Xi ,^2]. To focus only on the second type of uncertainty, 
we assume that we are able to perform single domain regression from each of the training sets with very 
small variance (at the expense of possible bias). Specifically, we assume that we can determine the 
^-optimal predictor of Y given Xi as well as the ^-optimal predictor of Y from X2, where s/ and ^ 
are classes of functions chosen in accordance with the cardinality of the two sets. Note that each of the 
single-domain predictors may be very poor In particular, if there are no labeled training examples from 
one of the domains then we choose the corresponding class of valid predictors to contain only the zero 
function. Therefore, if, for instance, we have Li ~ labeled examples from domain Xi, then we set 
jz/ = {0} so that the ^/-optimal predictor of Y given Xi is simply (p^ {Xi ) = 0. 

We further assume that the existence of many unlabeled examples (Xi ,^2) allows accurately deter- 
mining the joint distribution of Xi and X2, for example, using nonparametric methods. Finally, we 
assume that there are enough labeled examples from at least one of the domains such that the second- 
order moment of Y can be accurately estimated. The statistical relationships assumed known are 
depicted in Fig. 3. 

In a more mathematical language, assume we are given two functions (pj^ : M*'' — > M.^ and y/^ : 
R*^2 _>, ^N^ ^ cumulative probability function FxiXj over M*'' ^^^ and a scalar c > 0. Then, what we 
know regarding the RVs Xi,X2 and Y is that their distribution FxiX2Y belongs to the set ^ of distributions 
satisfying 

(p^-argminE[||y-(/)(Xi)f], i^.^ = argminE[||y - v/(X2)f ], 

Fx,x2Y{xuX2,c^)=Fx,x2i^uX2), E[||y||2]=c. (3.5) 

We assume throughout the paper that £/ and ^ form linear subspaces of functions, as discussed in 
Section 3.1. 
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Fig. 3: Known statistical relationships. Each of the single-domain predictors may perform arbitrarily 
poorly (in particular, it is possible that (pj^{Xi) = or V^/j(X2) ~ 0). 



As an illustrative example, suppose that Xi, X2 and Y are scalar RVs, and that £/ and ^ are the sets 
of all linear functions from R to M. Assume further that we know that the best linear estimator of Y 
from Xi is (pj^{Xi) = O.lXi, the best linear estimator of Y from X2 is y/,^{X2) — 0.2^2, the probability 
density function (pdf) of (Xi,X2) is fxiX2{xi,X2) °= exp{ — (x| +X2)/2}, and that E[Y^] = 1. Then the 
normal density 



/ziXjF (xi , X2 , y) °<: exp <( - - (xi X2 y) 



,0.1 0.2 1 




(3.6) 



qualifies with all these restrictions and is thus valid. In fact, there is an infinite number (a continuum) of 
other feasible densities. For instance, it can be easily verified that the Gaussian mixture pdf 



1 



1 0.2 



/xiX2}'(xi,X2,y)°<:exp<( -;^(xi X2 y) \ 1 
^ \0.2 1 



1 




'1 







exp{ --(xi X2 y) 1 0.4 
^ \0 0.4 1 




(3.7) 



is also consistent with all the restrictions, making it a valid candidate as well. By contrast, the density 

-1 



/X1X2F (xi , X2 , y) °<: exp <^ - - (xi X2 y 







0.2> 

1 0.2 



,0.2 0.2 1 



(3.8) 



satisfies all requirements except for the demand that it be consistent with the given marginal distribution 

fxiX2 (xi 7X2). Therefore, it is not feasible. 

3.3 Goals 

The first problem we address in this paper is multi-domain regression. In this context, we would like to 
construct a predictor of Y from the two domains Xi and X2, where the only knowledge we have is that 
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FxiXiY S -^- The second problem we tackle is single-domain regression. Here, the goal is to construct 
an estimator of Y given Xi alone based, again, only on the knowledge that FxiX2Y G ^ ■ The special case 
of shared-representation learning, in which no labeled examples from the first domain are available, 
corresponds to setting sz/ — {0}. The setting of cross modality learning, in which there is no access to 
training examples from the second domain, can be addressed by setting 3§ — {0}. The general case we 
treat here can account for a wide spectrum of possibilities, including these two extremes. 

Any predictor of Y, whether a function of Xi and X2 or of Xi alone, may perform well under certain 
distributions FxiXiY G -^ and worse under others. Our goal is therefore to uniformly minimize the MSE 
over J?. As we will see, this minimax approach leads to simple closed form solutions, which can be 
easily applied to the various settings discussed in Section 2. 

4. Multi-Domain Regression 

Assume that the joint distribution of the triplet {Xi,X2,Y) is known to belong to the family J? of (3.5), 
where ^ and ^ are linear subspaces of prediction functions. For any distribution FxiX2Y, the MSE 
attained by an estimator Y — p{Xi,X2) is defined as 

MSE(Fx,x2J',p)=E[||y-p(Xi,X2)f], (4.1) 

where the expectation is with respect to FxiXiY- Since the MSE depends on FxiX2Y, which is unknown, 
our approach is to seek the estimator whose worst-case MSE over J? is minimal. This minimax con- 
cept is widely practiced in deterministic parameter estimation [5, 6] as well as in random parameter 
estimation [7, 8]. More concretely, we are interested in' 

pM = argmin sup MS,E{Fx^x2Y,P)- (4.2) 

P Fx^X2Y<i-^ 

The next theorem, whose proof can be found in Appendix A, provides a means for solving this problem. 

Theorem 4. 1 (Multi-domain minimax-MSE prediction) Choose any distribution FxiX2Y £ -^ and con- 
sider the estimator 

p^- = argminMSE(/ijX2i',p), (4.3) 

where "^ = jz/ + .^, namely 

<^ = {p : p(xi,X2) = ^{xi) + \if{x2), ^ e s/ ,\if e m] . (4.4) 

Then 

1 . the function p<^ does not depend on the choice of Fx^x2Y G -^; 

2. the value MSE(FxiX2}'iP"!f ) does not depend on the choice of FxiX2Y £ -^; 

3. the estimator p^g of (4.3) is also the solution Pm to (4.2). 

Theorem 4. 1 shows that instead of solving the minimax problem (4.3), we can equivalently solve the 
minimization problem (4.2). Namely, all we need to do is determine the MMSE estimator of Y among 

'The subscript 'M' stands for 'multi-domain.' 



-V -h 



A- -k 



SEMI-SUPERVISED MULTI-DOMAIN REGRESSION 9 of 24 

all functions of the form 0(Xi) + yf{X2) with (p (z £/ and \f/ d £§. The importance of this observation 
follows from the fact that, as we show below, for many practical cases, the latter possesses a simple 
closed form solution. 

Before demonstrating the utility of the minimax MSE approach, we note that optimizing the worst- 
case performance of an estimator is very conservative and may sometimes lead to over-pessimistic 
solutions. As an alternative, researchers in many application areas have proposed minimizing the worst- 
case regret [6, 7, 14, 15]. The regret of an estimator p{Xi,X2) is defined as the difference between the 
MSE it achieves and the MSE of the MMSE solution, namely 

REG{Fx,x2Y,p) = E[\\Y - piXuX2)f] -E[\\Y -E[Y\XuX2]f] . (4.5) 

In this expression, both terms depend on FxiX2Y, so that minimization of the worst-case regret is gen- 
erally not equivalent to minimization of the worst-case MSE. Additional insight into the regret can 
be obtained from its equivalent characterization [15] as the MSE between p[Xi,X2) and E[y|Xi,X2], 
namely 

REGiFx,x2Y,p) =E[\\p{XuX2) -nY\XuX2]f] . (4.6) 

As we show in the following theorem, however, in the multi-domain prediction setting, the minimax- 
regret estimator coincides with the minimax-MSE solution. The proof of the theorem is provided in 
Appendix B. 

Theorem 4.2 (Multi-domain minimax-regret prediction) Consider the problem 

PR = argmin sup REG(FxiZ2F,P), (4.7) 

P FxiX2re,^ 

where minimization is performed over all functions p of Xi and X2. Then its solution Pr coincides with 

pM0f(4.2). 

We now apply Theorem 4.1 in several scenarios. 

4. 1 Single-Domain Training 

Consider the situation of figs. 1(a) and 1(b), where we have at our disposal only labeled examples from 
one domain, say Xi. In this case ,^ = {0} so that '^^ = £/. Consequently, the solution to (4.3) is simply 

pr^{XuX2)^(p.AXi). (4.8) 

This shows that in coming to label unseen examples, there is no gain in basing the prediction on the 
domain X2 for which we have no labeled training examples. Furthermore, at least from a worst-case 
perspective, there is no better strategy than using our initial predictor based on Xi alone. More con- 
cretely, for any estimator that differs from (p^{Xi) (and in particular one that is a function of X2), 
there exist distributions Fx^x2Y G ■^ (one maybe being the true underlying distribution) under which the 
predictor (ps/{Xi) performs better 

This result does not stand in contrast to the basic observation in multi-view learning that unlabeled 
data helps [2]. This is because in our setting, we do not assume that the two views are "coherent" or 
tend to agree in any sense, as done, for instance, in [10] in the context of multi-view regression. 
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4.2 Multi-Domain Linear Regression 

Suppose, as in Fig. 1(c), that we have a Umited amount of labeled examples from both domains, which 
only suffice for identifying (with very high precision) the optimal linear predictor from each view. In 
this case ^ and ^ correspond to the collection of all linear functions from R'^i to M!^ and from R*'^ to 
M^, respectively. Consequently, "^ is the set of all linear functions from M*^' x R*'^ xa R'^. This implies 
that the solution to (4.3) is simply the best linear predictor of Y based on Xy and X2, namely 

P.(x„x, = (r„, r„,)(^« ^-)'g)^ 

The second-order moments TxjKj, ij G {li2}, can be estimated from the unlabeled training set. Simi- 
larly, the matrices FyXj, i,j G {1:2}, can be determined from the labeled sets. 

The dependence of the multi-domain predictor p^g' on the single-domain estimators ^^^ and \j/^,g is not 
apparent at first sight. However, recall that the orthogonality principle states that E[{Y — (pj^{Xi))Xf] = 
and E[(y — v/^(X2))Xj] = 0. Therefore, the terms Fyxi and Fyx2 in (4.9) can be replaced by 
E[(/)^(Xi)Xj^] and E[v//j(X2)Xj], respectively. As these expectations are with respect to Fxi and Fx2, 
their computation can be carried out based only on the knowledge of FxjX2^ ^c/ and y/cg, which is 
available according to our problem formulation. 

4.3 Multi-Domain Parametric Regression 

The above observation naturally extends to the case in which the training sets suffice for identifying the 
optimal parametric predictors of the forms 

(p(Xi) = Y. al(p,{Xi), xi/{X2) - £ alxi/k{X2), (4.10) 

k=l k=l 

where {(Pk}kLi and {y/k}kLi are given functions and {fl|}^li and {af }jt=i are arbitrary parameters. In 
this situation, '^ corresponds to the family of functions having the form 

Ki K2 

p{XuX2) = £4%(Xi) + £fl,Vfc(X2). (4.11) 

k=l k=l 



Thus, the optimal set of parameters a = (a} ••■ fl^ a^ ■■■ a\\ is given by 

F<p(p Fcl,qi\ I F (pY 



with F00, Fqi^t, F,pY and Fqiy being as in (3.4) and F^,^/ being aKi x K2 matrix whose (/, 7')-th entry 
is K[(pi{Y)^ \i/j{Z)]. Similar to linear regression, the vectors Ftpy and Fqiy can be replaced, due to the 
orthogonality principle, by vectors whose j-th entries are E[(pJ {Xi)(pj^{Xi)] and E[\i/J {X-[)\i/,g{X2)], 
respectively. 

4.4 Multi-Domain Partially Linear Regression 

Suppose, as in Fig. 1(d), that we have numerous labeled examples from the first domain, allowing us to 
determine E [F |Xi ] , and only a limited amount of examples from the second domain, so that we can only 
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determine the best linear predictor of Y from X2. In this setting, Theorem 4. 1 implies that the minimax- 
optimal predictor based on Xi and X2 is the estimator minimizing the MSE among all functions of the 
form 

piXi,X2)^aiXi)+BX2, (4.13) 

where a : M.^' — > R^ is an arbitrary function and B G M.^^^^ js some matrix. It was shown in [16] that 
the solution to this particular case is given by 

Pm{XuX2)^E[Y\Xi] +rYwrlwW, (4.14) 

where W =X2-E[X2\Xi]. 

The intuition here is that we need to make sure we do not account for variations in Y twice when 
fusing information from Xi and X2 . Thus, we start with the estimate (p^ {Xi ) —K[Y\Xi], and then update 
it with the LMMSE estimate of F based on the innovation X2 — ¥,[X2\Xi] of X2 with respect to (pj^(Xi). 

In practice, the term E[y|Xi] can be approximated from the labeled training examples of the first 
domain, e.g., using nonparametric methods. The second term in (4. 14) can be obtained via a three-stage 
procedure. Specifically, we first employ a nonparametric technique to approximate £,{xi) = E[X2 \Xi = 
Xi] from the unlabeled set. Next, we use the unlabeled samples to form the set {^ i^)7^2}u=L +l +1' 
from which we approximate the covariance matrix Fww of W = X2 — E[X2|Xi]. Lastly, we approximate 
rYX2 from the labeled examples {x2,y^'}(^ ^1 and Fy^^xi) from the labeled examples {S, {x\),y^}g^^ 
in order to compute Fyw = Fyxi ^ ^yE(Xx)- 

4.5 Multi-Domain Semi-Parametric Regression 

Suppose as above, that we know E[F|Xi], however we can also determine the best estimator of Y from 
X2 among the parametric family 

K 

V/(X2)-^flir^(X2). (4.15) 

k=\ 

In this case, according to Theorem 4.1, the minimax-optimal estimator of Y based on Xi and X2 is the 
one minimizing the MSE among all functions of the form 

K 

p(Xi,X2)=a(Xi) + ^fliV^i(X2). (4.16) 

k=\ 

The solution to this problem can be deduced by relying on the concept of (i2/,,^)-innovation, as we 
now define. 

Definition 4.3 The (s/.M) innovation of X2 with respect to Xi, which we denote by p.s^^,gg{Xi,X2), 
is the MMSE estimator of Y among all functions of the form 

V/(X2)-77^(Xi), (4.17) 

with y/ being some function in ,^ and i]^f{Xi) denoting the jzZ-optimal estimator of '/(X2) from Xy. 

Using this definition, we make the following observation regarding the structure of the minimax 
estimator, the proof of which is given in Appendix C. 
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Theorem 4.4 The solution to problem (4.3) can be expressed as 

p^giXuX2) = (pAXi)+P^M^i^^2), (4.18) 

where Pj^„-jg{Xi,X2) is the (j2/,,^)-innovationof X2 with respect to Xi. 

In our setting, £/ corresponds to the set of all functions from M*^' to M.^ so that (pj^{Xi) =E[Y\Xi]. 
Furthermore, ^ is the family of functions from R*^^ iq ^n jj^ying the form (4.15). Therefore, for any 
y/ E ^, the ^-optimal estimator of y{X2) based on Xi is given by 



77y(Xi) = E[v/(X2)|Xi]=E 



K 

Y^^kYkiXl] 
k=l 



Xi 



K 

■ Y^akE[xi/kiX2)\Xi]. (4.19) 

k=l 



Consequently, Pi^,^(^i,^2) in (4.18) is of the form 



V/(X2) - ri^iXi) =. I; ak\l/kiX2) - f^ aknWkiX2)\Xi] = f^ akPk{Xi,X2), (4.20) 

k=\ k=\ k=l 

where we denoted Pi (Xi,X2) = \lfk{X2) —E.[\ifk{X2)\Xi\. The optimal set of coefficients is given by 

a*=rpprpY (4.21) 

where Fpp and Fpy are as in (3.4) with (pi{Xi ) replaced by p,(Xi ,^2). 
To conclude, the optimal estimator of the form (4. 16) is 

K 

Pu{XuX2)=nY\Xi\ + Y.^k{Wk{X2)-nWk{X2)\Xi\), (4.22) 

k=l 

with coefficients {«<;} given by (4.21). The first term in this expression can be approximated via non- 
parametric regression techniques from the labeled training examples of the first domain. The second 
term can be computed in two stages. First, each of the functions { V*:(^2)}f=i is regressed on X\ using 
the unlabeled data set, to obtain an approximation of E[v/i(X2)|Xi]. Then, Y is linearly regressed against 
{WkiXi) ~^[Wk{X2)\X\\]f^^, using the two labeled sets, as discussed in Section 4.4. 

5. Single-Domain Regression with Multi-Domain Training 

Next, we address the setting in which at the testing stage our predictor is only supplied with one type of 
features, say Xi . The interesting question in this context is how to take into account the training sets of 
both domains in order to design an improved estimator of Y based on Xi alone. 

Since our estimator operates on X\ and is judged by the proximity of its output to Y, its performance 
is only affected by the joint distribution of Y and Xi . It may thus seem at first that the second set of 
features X2 cannot be of help in improving estimation accuracy. However, note that Fx[F is not fully 
known in our setting. Thus, being told the statistical relations between Y and X2 and between X\ and 
X2, might help to narrow down the set of candidate distributions FxiY for which we need to design an 
estimator. 

The statistical relations known to us are the same as in Section 4. Namely, we know that FxiX2Y 
belongs to the class ^ of (3.5). Therefore, as in Section 4, our goal is to optimize the worst case 
performance of our estimator over ^. As it turns out, in contrast with the multi-domain problem, in 
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the single-domain setting the minimax MSE and minimax regret solutions no longer coincide. Here, we 
focus on minimizing the worst-case regret. As will be clear from the proof provided in Appendix 5.1, 
determining the minimax-MSE estimator in the single-domain setting is much harder than minimizing 
the worst-case regret. The former remains an open problem. 

In single domain regression, whatever we do, our estimator will not achieve lower MSE than the 
conditional expectation E[7|Xi]. Therefore, the regret of interest is now 

REG(Fx,x2F,p)-E[||i'-p(Xi)||2]-E[|ly-E[y|Xi]ll2]. (5.1) 

As in the multi-domain setting, this regret here can be written as [15] 

REG(Fx,x2F,p) = E[||p(Xi)-E[y|Xi]||2]. (5.2) 

Our goal is to determine the minimax-regret estimator^ 

Ps = argmin sup REG(/ijX,F,p), (5.3) 

where now minimization is performed only over functions p of Xi . 

The next theorem, whose proof may be found in Appendix B, describes the single-domain minimax- 
regret estimator in terms of the multi-domain minimax-MSE solution. 

Theorem 5.1 (Single-domain minimax-regret prediction) The solution to problem (5.3) is given by 

Ps(Xi)=E[pm(Xi,X2)|Xi], (5.4) 

where Pm(^17^2) is the multi-domain minimax estimator (4.2). 

This result has a very simple and intuitive explanation. We know that Fx^x2Y belongs to the set ^, 
and therefore Pm(^i,^2) is the optimal estimate of F in a minimax-MSE sense. However, we cannot 
use this estimate as it is a function of X2, which is not measured in our setting. What Theorem 5.1 shows 
is that the optimal strategy is to estimate Pm(-^i,^2) based on the available measurements, which are 
Xi alone. Computation of the conditional expectation E[pm(^i ,^2) l^i] only requires knowledge of the 
marginal distribution FxiX2^ which is available in our setting. 

We now apply this result to two interesting special cases. 

5.1 Cross Domain Regression 

In cross-modality learning [17], we only have labeled examples from domain Xi and not from X2, as 
illustrated in figs. 2(a) and 2(b). The basic intuition here, as presented in [17], is that the unlabeled data 
may be used to boost the performance of the best single-domain estimator (pj^{Xi) that can be designed 
based solely on labeled examples from the domain Xi . 

This setting can be treated within our framework by setting \l/,ig{X2) = 0. As we have seen in Sec- 
tion 4. 1 , in this situation Pm (Xi ,X2) — (Pa/ {Xi ) . Therefore, the single-domain minimax-regret predictor 
of Y from Xi is given by 

PsiXi)^K[(p^iXi)\Xi] = (p^(Xi). (5.5) 

^The subscript 'S' stands for 'single-domain.' 
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We see that despite the fact that we know Fx^x2^ there is no better strategy than using the estimator 
(Pj^ {Xi ) here. This implies that cross-modality learning is not useful unless additional knowledge on the 
underlying distributions is available. 

The authors of [17] used cross-modality learning to classify isolated words from either audio or 
video (lipreading). It was reported that unlabeled audio-visual examples helped improve visual recog- 
nition but failed to boost the performance of an audio classifier. This empirical result aligns with our 
theoretical analysis, which states that, in the worst-case scenario, there is nothing better to do than 
disregarding the modality for which no labeled examples are available. 

5.2 Shared Representation Regression 

In shared-representation learning [17], also referred to as estimation with partial knowledge [15], we 
have no labeled examples from domain Xi but rather only from X2. This is illustrated in figs. 2(c) 
and 2(d). Since we can learn a predictor \I/,^{X2) from the second domain, and only measure an instance 
Xi from the first domain, a naive approach would be to feed the predictor xj/^gg with an estimate of X2, 
which is based on Xi, rather than with X2 itself. For example, the MMSE estimate E[X2|Xi] can be 
approximated by nonparametric methods from the unlabeled training set. However, as we now show, 
this strategy is generally not minimax-optimal. 

Recall from Section 4. 1 that the multi-domain predictor corresponding to the setting in which £/ = 
{0} is Pm{Xi,X2) = \l/,og{X2). Therefore, the single-domain minimax-regret predictor of F from Xi is 
given by 

ps(Xi)=E[v/,5g(X2)|Xi] (5.6) 

in this case. This solution generalizes the estimator of [15, Thm. 8], which was developed for the case 
in which ^ is the set of all functions. In the latter scenario, \l/,ig{X2) — E[Y\X2], and the two methods 
coincide. 

As an example, consider the setting in which we have a limited number of labeled examples from 
domain X2, which only allows to determine the best linear predictor of y from X2. In this case, \ifgg{X2) ~ 
^YX2rl^X2^2, implying that ps{Xi) = E[rYX2r'l^x2^2\Xi] = rYX2rl^x2^\-^^\^^'\- Namely, minimax- 
regret estimation does boil down, in this setting, to the naive strategy of applying y/^ on E[X2|Xi]. 
This, however, is not always the case. Suppose, for instance, that we have numerous examples from 
domain X2, so that .^ is the set of all functions from R^^ to R^ . In this situation, \j/,<g{X2) = E[y|X2], 
so that ps{Xi) ~ E,[K[Y\X2]\Xi]. This solution does not generally coincide with the naive estimator 
E[Y\E[X2\Xi]]. 

The estimator (5.6) can be approximated from the available training data by first determining the 
function V^,^(je2) from the labeled set of the second domain and then using nonparametric regression on 

thesetW;vA^(4)}^it^^tr+i- 

5.3 Regression with Side Information 

The general setting in which we have training data from both domains can be treated by employing 
Theorem 4.4. Specifically, when £/ and 3§ are two arbitrary spaces of prediction functions, Pm(^1j^2) 
is given by (4.18), and therefore 

Ps{Xi) - (p.^iXi)+E[p^^^{Xi,X2)\Xi], (5.7) 

where p si/ „gg {Xi,X2) is the (j^,^) innovation of X2 with respect to Xi. This representation highlights 
the fact that the second labeled set and the unlabeled set come into play in the term E[p^ ,fg(Xi,X2)|Xi]. 
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To understand when training data from an unobserved domain cannot help, we recall from Defini- 
tion 4.3 that p^^.^(Xi,X2) is of the form \if{X2)~rj^,{Xi), with y/G ^ andriif,{Xi) being the i7/-optimal 
estimate of y/{X2) from Xi . Therefore, the second term in (5.7) vanishes if, for example, 

E[va(X2)|Xi] = 77^(Xi) (5.8) 

for every i^ G ^. Intuitively, this can happen if the class s/ of functions is very rich and/or the class 
^ is not. As an example, if .s/ is the set of all functions from M*'' to W^ then ri^{Xi) := E[v/(X2)|Xi], 
so that (5.8) is satisfied, indicating that the training set from the second domain is not needed. Indeed, 
in this situation (pj^{Xi) —¥,[Y\Xi], meaning that we can already determine the MMSE predictor of Y 
from Xi using the first training set so that no potential improvement can be obtained using the second 
set. 

As a more interesting example, suppose that the RVs Xi and X2 are jointly Gaussian, that ^ is the 
set of all linear functions from M.^^ to R^, and that s/ contains the set of all linear functions from R*^' 
to MJ^ . In this case, every y/ G ^ corresponds to some matrix A such that '/(X2) = AX2. Consequently, 
using the fact that the MMSE estimate is linear in the Gaussian setting, 

E[xi/{X2)\Xi] = E[AX2\Xi] ^AE[X2\Xi] = AFx^Xir^^x,^!- (5-9) 

Moreover, Xi and '/(X2) are jointly Gaussian, implying that 

ri^iXi) = r^(x,)xA,xXi =Arx,x,rlxXi. (5.10) 

Thus, (5.9) and (5.10) coincide and (5.8) is satisfied, indicating that the second training set is not required 
here as well. 

Another interesting viewpoint can be obtained by switching the roles of Xi and X2 in the represen- 
tation (4.18) of Pm (^1,^2)- This leads to the expression 

psiXi)=E[xi/,^iX2)\Xi]+E[p^^^{X2,Xi)\Xi]. (5.11) 

Here, we recognize the first term as being the shared-representation estimator (5.6) of Y from Xi, 
which does not use labeled examples from the domain Xi. Therefore, we see that the training set 
from the first (observed) domain is not needed if the second term in (5. 11) vanishes. Using the fact that 
P.^,s!/iX2,Xi) == (p{Xi) - r]^{X2) with (p e £>/ and r],p{X2) being the ^-optimal estimate of (p{Xi) from 
X2, we conclude that this happens if, for example, 

(p{Xi)^E[ri^{X2}\Xi] (5.12) 

for every (p G s^. As a concrete example, consider again the setting in which the RVs Xi and X2 are 
jointly Gaussian and £/ and ^ are classes of linear functions. In this situation, (p(Xi) = AXi for some 
matrix A, so that 77^ (X2) = F ^i^Xi)X2^X2X2^2 =ArxiX2rl^x2^2 and, consequently, 

E[77^(X2)|Xi] =Arx,x2rl^x2nx2\Xi] =Arx,x2rix2rx2x,rlxXi. (s.u) 

Therefore, (5.12) is satisfied if T XiX2rl^x2^ X2Xirl^Xi = t, or, equivalently if FxiX, ~rxiX2r\^x2^X2Xi = 
0. The latter expression is no other than the error covariance of the MMSE estimate of Xi from X2. 
Therefore, condition (5.12) is satisfied in this setting if Xi can be estimated from X2 with no error. 
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Indeed, in this scenario, we do not need to observe training examples from the domain Xi, as these can 
be synthetically generated from the examples of the second domain. 

To approximate the resulting estimators from sets of points, it is often more convenient to use the 
form (5.11) rather than (5.7). As a concrete example, consider linear regression with nonlinear side 
information, namely where £/ is the set of all linear functions and ^ is the family of all (not necessarily 
linear) functions. Then, from Theorem 5.1 and (4.14) we conclude that 

Ps(Xi) -E[E[y|X2]|Xi]+rKwrJyw(^i -E[E[Xi|X2]|Xi]), (5.14) 

where here W=Xi-E[Xi \X2] . The terms E[E[7 IX2] \Xi] and E[E[Xi IX2] \Xi] can be approximated using 
nonparametric methods, similar to the discussion in Section 5.2, and the covariance matrices Fyw and 
Fww can be approximated as in Section 4.4. 

6. Experimental Results 

We now demonstrate our regression approach, that derives from the theoretical results just presented, in 
two illustrative applications. 

6.1 Face Normalization 

Many facial recognition methods rely on a preprocessing stage, coined normalization, which is aimed at 
removing variations that were not observed in the training database. These may include variations due 
to illumination, pose, facial expressions, and more. To demonstrate the utility of our approach, we now 
focus on the problem of producing a neutral expression face from a smiling one. 

A straight forward way of tackling this problem is to learn a regression function from pairs of 
training images. This requires a database in which each subject appears at least twice, one time with 
a neutral expression and one time with a smile. Unfortunately, large data sets of this sort are hard to 
collect. In many practical situations one only has access to a database in which each subject appears only 
once. While different subjects may be wearing different expressions, direct inference of the statistical 
relation between a smiling and a neutral face is virtually impossible in such scenarios. To bypass this 
obstacle, we can use a second domain, or view, for which it is easy to obtain examples that are paired 
with the images in the database. This can be done, for example, by manually marking a set of points in 
several predefined locations on all images in the database. Thus, denoting by (Xi,X2,Y) a triplet of a 
smiling face, its point annotations, and the corresponding neutral expression image, we may construct an 
unlabeled set of annotated smiling faces {x" , JCj } and a set of annotated neutral expression faces {jCj , y^ } ■ 
This allows employing our shared-representation regression technique for designing a predictor of Y 
based on Xi . If, in addition, several subjects were photographed more than once, then we may construct 
a third set {x\,y^}, containing pairs of images of smiling and neutral-expression faces. In this case, we 
can apply regression with side information, as discussed in Section 5.3. 

Figure 4 depicts several manually annotated neutral and smiling facial images taken from the AR 
database [12]. The point annotations were taken from http : //www-prima . inrialpes . f r/ 
FGnet/data/05-ARFace/tarf d_markup . html. The images were scaled, rotated and cropped 
into an ellipsoidal template such that the eyes appear at predefined locations. In practice, this can be 
performed automatically [13, 20]. To apply our methods, we normalized the images to be of zero mean 
and unity norm and reduced them to 86 dimensions using PCA. The nonlinear regression scheme we 
used as a building block in our methods was first-order polynomial regression with a Gaussian kernel. 
The bandwidth of the kernel was adaptively tuned to be a constant times the root of the average squared 
distance between the query and the training data points. 
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Fig. 4; Annotated images from the AR database. 
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Fig. 5: Neutral expression synthesis from smiling images. From left to right: query, ground truth, direct 
nonlinear regression, shared-representation nonlinear regression (Section 5.2), linear regression with 
nonhnear side information (Section 5.3). 



Figure 5 demonstrates the results obtained with our approach in several settings. The two left- 
most columns correspond to the query smiling face and the corresponding desired (unobserved) neutral 
expression image. The third column shows the result of directly performing regression using 118 pairs 
of smile/neutral images. The fourth column is the result of performing shared representation regression 
via (5.6), using a training set of 38 annotated smiling faces and a set of 40 annotated neutral images (of 
different subjects). The rightmost column uses, in addition to these two sets, a training set comprising 
40 pairs of images of neutral and smiling expressions to perform Unear regression with nonlinear side 
information (equation (5.14)). 

Table 1 shows the root MSE (RMSE), (E[||F-y|p])i, attained in each of the settings. As expected, 
using direct training with 118 examples yields the best results (lowest RMSE). It can be seen that 
employing two sets with roughly 40 examples each, instead of direct training, leads to an increase in 
the RMSE by 41%. This gap is reduced to 32% with the aid of an additional set of 40 direct training 
pairs. Perceptually, the images produced by the indirect methods do not seem to be much worse than 
those obtained with direct training. Note that the spatial smoothing apparent in all methods is due to 
the fact that any regression methods boils down at the end to some sort of averaging of many images 
from the training set. It is also important to note that the vague traces of glasses in the last two columns 
are no coincidence. Specifically, when there are no (or very few) joint examples of smile/neutral faces, 
no method can ever be able to determine whether the person wears glasses or not. This is because we 
only know how the smiling images (pixel values) relate to the geometry (point annotations) and how the 
geometry relates to the neutral images. Now, for every possible geometry, roughly half the people in the 
neutral database wear glasses and half not. 
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Table 1 : Performance of Neutral Expression Synthesis Methods 



Setting 


RMSE 


Direct nonlinear regression 


0.193 


Shared-representation nonlinear regression 


0.263 


Linear regression with nonUnear side information 


0.247 




Fig. 6: Processing of the video and audio of a speaker saying the word 'nine'. From left to right: lip 
detection, spectogram, extracted lip region. 



6.2 Audio-Visual Word Recognition 

Although the entire discussion in this paper has focused on regression, similar methods can be developed 
for classification tasks. To support our claim, we now illustrate that this can even be achieved by 
using the naive approach of performing regression and then quantizing the output in order to obtain a 
classification rule. 

Specifically, we now consider the tasks of spoken digit classification from audio-only and video- 
only measurements. To study this task, we used the Grid Corpus [4], which consists of speakers saying 
simple-structured sentences. Every sentence contains one digit, which we isolated using the supplied 
transcriptions. We constructed three distinct training sets: one of labeled audio examples (4 males, 4 
females), one of visual examples (4 males, 4 females), and one of unlabeled audio-visual examples (6 
males, 4 females). Six speakers were used for testing (3 males, 3 females). 

To process the video, we converted the images to gray scale, used the face detection method of 
[11], and then applied several mean-shift iterations on the gradient image map in order to extract the lip 
region in the first image of each frame-bunch. Segments of duration 320msec were used for recognition. 
This corresponded to 8 consecutive video frames (at a rate of 25 frames per second) and 1600 audio 
samples (at a sampling rate of 5KHz). The image frames were reduced to 10 dimensions using PCA, 
resulting in an 80-dimensional video feature-vector. The processing of the audio was performed by 
computing spectograms with windows of duration 10msec and an overlap of 2.5msec. The dimension 
of the spectogram was reduced to 180 to constitute the audio features. In all experiments Y was a 10- 
dimensional vector with 1 at the location corresponding to the spoken digit and elsewhere. Figure 6 
visualizes the basic audio-visual preprocessing. 

As mentioned above, our approach is designed for regression, so that the predicted K is a continuous 
variable. To perform classification, we chose the maximal element in Y. For simplicity, jz/ and ^ were 
taken as the sets of all linear functions (linear regression). This choice yields rather poor classification 
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Table 2: Audio- Visual Digit Classification Performance 



Feati 
Training 


ires 

Testing 


Accuracy 
Minimax Deep RBM 

(Grid corpus) (CUAVE) 


Audio 


Audio 


69.3% 


95.8% 


Video 


Video 


52.0% 


69.7% 


Video 


Audio 


50.1% 


27.5% 


Audio 


Video 


44.6% 


29.4% 



results based solely on audio or solely on video. Our goal, though, is to demonstrate that even with such 
naive single-domain predictors, we can attain good recognition accuracy by using our approach, which 
cleverly fuses the two domains. 

Table 2 shows the accuracy of the our approach and for reference also presents the results obtained 
with the deep restricted Boltzmann machine (RBM) of [17] on the CUAVE dataset [19]. The Grid 
corpus used here is more challenging in that the digits appear within sentences, rather than individually. 
As can be seen, the single-domain predictors we start with perform relatively poorly (rows 1 and 2). 
Nevertheless, in the shared-representation settings (rows 3 and 4), our predictors perform much better 
than the RBM method, even for a harder dataset. Their accuracy is only between 7% and 20% worse 
than the corresponding single domain estimators (rows 1 and 2, respectively). By contrast, the difference 
in success rates for the RBM predictor is between 30% and 70%. 

7. Conclusion 

In this paper, we analyzed the problems of multi-domain and single-domain regression in settings involv- 
ing distinct unpaired labeled training sets for the different domains and a large unlabeled set of paired 
examples from all domains. We derived minimax-optimal results and obtained closed form solutions for 
many practical scenarios. We used the resulting expressions to study when training data from a domain, 
which is not available during testing, can help. In particular, we showed that in the setting of cross- 
modality learning, originally presented in [17], there is no advantage in using the training data from 
the unobserved domain, at least from a worst-case perspective. We demonstrated our methods in the 
context of synthesis of a neutral expression face from an image of a smiling subject and in the context 
of audio-visual spoken digit recognition. In the latter setting, we demonstrated that our approach may 
be more effective than that proposed in [17]. This is despite the fact that our method is designed for 
regression rather than classification and even though we applied it on a more challenging audio-visual 
sentence corpus. 



A. Proof of Theorem 4.1 

We begin by proving claim 1 . Since jz/ is a linear subspace, the orthogonality principle implies that 
(p^ (Xi ) is the unique estimator satisfying 



E[{Y-(p^{Xi)f(p(Xi)]^0 
for every (p £ £/. Consequently, for every ^ e ^ we have that 

E[Y^(p{Xi)]^E[(p,^iXif(p{Xi)]. 



(A.l) 



(A.2) 
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Similarly, for every y/ € ^ we have that 

E [Y^\I/{X2)] = E [wMXifwiXi)] ■ (A.3) 

Finally, as "^ = jz/ + ,^, the set "^^ is a subspace as well. Therefore, p<^ of (4.3) is the unique estimator 
satisfying 

E[y^((p(Xi) + v/(X2))] =E[p.4Xi,X2fi(p{Xi) + xi/iX2))] (A.4) 

for every 9 G ^ and xj/ E ^. Substituting (A. 2) and (A.3), condition (A.4) reduces to the requirement 
that 

E[(p^{Xif(p{Xi)] +E[xi/,gg{X2fxi/{X2)] =E[p^^{XuX2f{(p{Xi) + xi/{X2))] (A.5) 

for every (p £ £/ and y/ G ^. Now, the £/- and ^-optimal estimators of Y from Xi and X2 are fixed 
over ^ (given by (pj^ and y/^, respectively). Furthermore, all expectations in (A.5) are with respect to 
FxiX2^ which is also fixed over J^. This implies that the function p<^ does not depend on the choice of 
FX1X2Y G '^, completing the proof of claim 1. 

To prove claim 2, we note that from the orthogonality principle (A.4) follows the Pythagorean rela- 
tion 

E[\\Y-p.AXuX2)f]=E[\\Yf]-E[\\p^iXuX2)f]. (A.6) 

The first term on the right-hand side equals c for every Fx^X2Y G ^- We have also seen that p%^{X\ ,X2) is 
fixed over ^. Moreover, the expectation in the second term is with respect to ,^XiX2^ which is fixed over 
^. Therefore, the second term, as well, does not depend on the choice of FxiX2Y £ ^- This completes 
the proof of claim 2. 

Lastly, we prove claim 3. To do so, we first note that (ps^{Xi) and (^.^(^2) are not only the £/- 
and ,^-optimal estimators of Y based on Xi and X2, respectively; they are also the £/- and ^-optimal 
estimators of p<g{X\ ,^2). To see this, note that both £/ and 3§ are contained in '^. Consequently, the 
orthogonality principle implies that for every (p E £/ (which is also in '^), we have 

E[||y-(p(Xi)||2]=E[||y-p^(Xi,X2)f]+E[||p^(Xi,X2)-(p(Xi)|l2]. (A.7) 

As the first term does not depend on (p, we see that minimization of the MSB over 9 e ^ is equivalent 
to minimization of the second term alone. Thus, (p_^ {Xy ) is the ^-optimal estimate of p^g [Xi , X2 ) given 
Xi . The same argument can be invoked to deduce that xj/r^ {X2 ) is the ^-optimal estimate of p<^ {Xi , X2 ) 
fromX2. 

A second observation we need for proving claim 3 follows from the fact that s/ and ^ are linear 
subspaces. Specifically, this implies that if ^j* (V) and 9| (V ) are the izZ-optimal estimates of the two RVs 
Wi and W2, respectively, based on the RV V, then the ^/-optimal estimate of Wi + W2 is ^j* (V ) + ^^ (V ). 
This can be seen by noting that the estimator (jOj* (V ) + ^^ (^) satisfies the orthogonality principle, namely 
for any (p E £/ we have that 

E[(Wi +W2- (ptiWi) - (p^{Wi)f(p{Wi)] = E[(Wi - (pt{Wi)f(p{Wi)]+E[{W2 - (p^iWi)f(p{Wi)] 

= 0. (A.8) 

The statement also holds, of course, with respect to ^-optimal estimates. 
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Following these two observations, for any FxiX2Y G -^, setting Y = 2p(^{Xi,X2) —Y results in a 
distribution F-^.x^i ^^^ ^l^o belongs to J?. This is because the j2/-optimal estimate of? fromXi equals 
twice the ^-optimal estimate of p'g{X\^X2) fromXi (which is ^^(Xi)) minus the ^-optimal estimate 
of Y from Xi (which is also (p^^{Xi)). Namely, the ^-optimal estimate of Y from Xi is (p^/{X{). 
Similarly, the ^-optimal estimate of? fromX2 is y/£g(X2). Finally, due to the orthogonality principle, 
the second-order moment of ? is given by 

E[|l?|l2]=E[|lp^.(Xi,X2)||2]+E[||y-p^(Xi,X2)||2] 

= E[||p^(Xi,X2)f]+E[l|y|l2]-E[p^(^i,^2)f] 

= c. (A.9) 

We now use this fact to prove claim 3. The orthogonality principle (A.4) implies that the MSB 
attained by any estimator p satisfies 

E[||7-p(Xi,X2)f]=E[||y-p<^(Xi,X2)f]+E[||p.^(Xi,X2)-p(Xi,X2)f] 

+ 2E[(y-pc^(Xi,X2))^(p«'(Xi,X2)-p(Xi,X2))] 
= E[||F-p^(Xi,X2)f]+E[||p.^(Xi,X2)-p(Xi,X2)f] 

+ 2¥.[{p^{X,,X2)-Yf p{XuX2)\ . (A.IO) 

The first term in this expression is not a function of p and, as we have seen in (A. 6), is constant as a 
function of Fx^XiY over ,J?. The second term is a function of p, but since the expectation is with respect 
to FxiX2' it is constant as a function of FxiX2Y over ^. Therefore, 

min sup MSE{Fx,x2Y,p) ^E[\\Y - p^{Xi,X2)f] +min|E[||p^(Xi,X2) -p(Xi,X2)f ] 

P FXiX2Ye.^ P ^ 

+ sup 2E[{p^{XuX2)-Yfp{XuX2)]}. 

(A.ll) 

We saw that for every Fx^x2Y G ■^ setting ? = 2pcg{Xi ,^2) — Y results in a distribution F^^f that also 
belongs to .^. Now, with F^^^^y, the expression 2E[(p<^(Xi ,^2) -?)^p (^1 ,^2)] equals -2E[(p<^(Xi ,^2) - 
YY p {X\ ,^2)] . Consequently, the maximum of this term over Fx\X2Y G -^ is necessarily nonnegative. 
We thus have that 

min sup MSE(Fx,x2y,p) ^E[|ly-p^(Xi,X2)f] +minE[||p.^(Xi,X2) -p(Xi,X2)f ] 

P FxiXjl-G.-^ P 

= E[||y-p^(Xi,X2)f], (A.12) 

where we used the fact that the minimal value of is attained with p (Xi ,^2) = p<^(Xi ,X2). 

We have established a lower bound on the worst-case MSB of any estimator Next, we show that the 
estimator p{Xi,X2) = p<ff{Xi,X2) attains this bound, which proves that it is minimax-optimal. Indeed, 
substituting this solution into (A.IO), we find that 

sup MSEiFx^x2Y,P^^)=m\Y-p^giXi,X2)f], (A.13) 

FxiX2Y'S-^ 

completing the proof. 
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B. Proof of Theorems 4.2 and 5.1 

We simultaneously prove Theorems 4.2 and 5.1 by using an auxiliary RV Z, which can be any (fixed) 
function of Xi and X2. Therewith, we will study the solution to 

argmin sup REG(/i,x2y,p), ('^■l) 

where minimization is performed over all functions p of Z and the regret is with respect to E[y|Z]. 
Specifically, we will show that the solution to this problem is given by E[pm(^i,^2)|Z]. Setting, Z = 
{X[,X2y, we get 1K[pm{Xi,X2)\Z] = Pm{Xi,X2), proving Theorem 4.2. Setting Z == Xu the solution 
becomes E[pm(^i,^2)|^i]^ proving Theorems. 1. 

Expressing Y = Pm (Xi , X2 ) + (F — Pu (Xi ,X2)), the regret of any estimator p (Z) can be written as 

E[||E[7|Z]-p(Z)||2]=E[||E[pM(Xi,X2)|Z]-p(Z)||2]+E[||E[y-pM(Xi,X2)|Z]||2] 

+ 2E[E[y-pM(Xi,X2)|Z]^(E[pM(Xi,X2)|Z]-p(Z))] . (A.2) 

Since the marginal distribution FxiX2 is fixed over ^, the first term in the above expression does not 
depend on the choice of Fx,x2Y G ^- Consequently, 

sup REG{Fx^x2Y,p)^Em[pM(XuX2)\Z]-p{Z)f]+ sup J E[||E[y -Pm(Xi,X2)|Z]||2] 

FxiX2Y'S-^ FXtX2Ye.^ [ 
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E[F-Pm(Xi,X2)|Z]^(E[pm(Xi,X2)|Z]-p(Z)) 



(A.3) 



As we have seen in Appendix A, for every FxiX2Y G -^ setting Y — 2pm(^i ,^2) — F results in a distribu- 
tion F^ vf that also belongs to ^. Now, Y — Pui^i ,X2) = —{Y — Pm{Xi ,^2))^ implying that if Fx^x2Y 
maximizes the first term within the braces, then either FxiX2Y '^^ F^^f yields at least the same value for 
the objective comprising both terms. Therefore, 

min sup REG(Fx,x2F,p)^minE[||E[pM(Xi,X2)|Z]-p(Z)f] 

+ sup E[||E[y-pM(Xi,X2)|Z]||2] 

Fx^X2Y&-^?^ 

= sup E[||E[y-pM(Xi,X2)|Z]||2], (A.4) 

where the last equality is due to the fact that p{Z) — E[pm(^i,^2)|-Z] achieves the minimal value of 
in the first term. 

We established a lower bound on the worst-case regret of any estimator Next, we show that the 
estimator p * (Z) — E[pm(^i ,^2)|-Z] attains this bound, which proves that it is minimax-optimal. Indeed, 
substituting this solution into (A.3), we find that 



sup ^REG{Fx^x2Y,Pm) = _ sup _E[||E[y-pM(Xi,X2)|Z]f], (A.5) 

completing the proof. 



FxiX2Y£-^ FXiX2Y<^-^ 
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C. Proof of Theorem 4.4 



To prove the claim, we show that the estimation error corresponding to p<^{Xi,X2) of (4.18) is uncorre- 
cted with every RV of the form (p(Xi) + \I/(X2) with (p e £/ and xj/ e^. Indeed, for every (p e^/, the 
estimator p<g'(Xi,X2) of (4.18) satisfies 

E[iY-p<^iXi,X2)f(p{Xi)]=E[{Y-(p*iXi)f(p{Xi)]-E[p];^^^iXi,X2)(p(Xi)] 



= E 



{xi/{X2)-ri^{Xi)f(p{X,) 



= 0, (A.l) 

where we used the orthogonality principle. To prove orthogonality with respect to RVs of the form 
W{X2), with v/ e .^, we write V^(X2) = '^(^2) - ^if/iXi ) + rj^{Xi), where T]^ (Xi ) is the j2/-optimal esti- 
mate of y(X2) based on Xi. By the orthogonality principle, the errors Y — (pj^{Xi) and p^,.^(Xi ,^2) — 
'^(^2) ^ '?v/(^i ) ^e uncorrected with any RV 77 {Xi ), where 77 e jz/, and thus in particular with the term 
rj \f,{Xi). Therefore, we have that 

E [{Y-YfwiXi)] = E [(y - (p^(Xi) - p^,^(Xi,X2))^ {wiX2) - %(Xi)) 

= 0. (A.2) 

Here, the second equality results from the fact that the term \I/{X2) — '7v/(-^i) is orthogonal to every 
RV <j!>(^i), where (p £ £/ and, in particular, to (pg/{Xi). The third equality follows from the fact that 
pj^,:Sg{XiiX2) is the MMSE estimate of Y among all functions of the form V^(X2) - 77y,(Xi), with x^r 
being some function in ^ and ri^,{Xi ) being the ^-optimal estimator of i^(^2) from Xi . Consequently, 
the error F — p^.,^(Xi,X2) is orthogonal to every RV of the form v/(X2) — 77^(Xi), and, in particular, to 

wiX2)-MXiy 
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